A Neural Approach to Automated Essay Scoring


Kaveh Taghipour and Hwee Tou Ng Department of Computer Science National University of Singapore

13 Computing Drive

Singapore 117417

{kaveh, nght}@comp.nus.edu.sg


Abstract

Traditional automated essay scoring systems rely on carefully designed features to evaluate and score essays. The performance of such systems is tightly bound to the quality of the underlying features. However, it is laborious to manually design the most informative fea- tures for such a system. In this paper, we de- velop an approach based on recurrent neural networks to learn the relation between an es- say and its assigned score, without any fea- ture engineering. We explore several neural network models for the task of automated es- say scoring and perform some analysis to get some insights of the models. The results show that our best system, which is based on long short-term memory networks, outperforms a strong baseline by 5.6% in terms of quadratic weighted Kappa, without requiring any fea- ture engineering.


  1. Introduction

    There is a recent surge of interest in neural networks, which are based on continuous-space representation of the input and non-linear functions. Hence, neural networks are capable of modeling complex patterns in data. Moreover, since these methods do not de- pend on manual engineering of features, they can be applied to solve problems in an end-to-end fashion. SENNA (Collobert et al., 2011) and neural machine translation (Bahdanau et al., 2015) are two notable examples in natural language processing that oper- ate without any external task-specific knowledge. In this paper, we report a system based on neural net- works to take advantage of their modeling capacity


    and generalization power for the automated essay scoring (AES) task.

    Essay writing is usually a part of the student as- sessment process. Several organizations, such as Educational Testing Service (ETS)1, evaluate the writing skills of students in their examinations. Be- cause of the large number of students participat- ing in these exams, grading all essays is very time- consuming. Thus, some organizations have been us- ing AES systems to reduce the time and cost of scor- ing essays.

    Automated essay scoring refers to the process of grading student essays without human interference. An AES system takes as input an essay written for a given prompt, and then assigns a numeric score to the essay reflecting its quality, based on its content, grammar, and organization. Such AES systems are usually based on regression methods applied to a set of carefully designed features. The process of fea- ture engineering is the most difficult part of building AES systems. Moreover, it is challenging for hu- mans to consider all the factors that are involved in assigning a score to an essay.

    Our AES system, on the other hand, learns the features and relation between an essay and its score automatically. Since the system is based on recur- rent neural networks, it can effectively encode the information required for essay evaluation and learn the complex patterns in the data through non-linear neural layers. Our system is among the first AES systems based on neural networks designed with- out any hand-crafted features. Our results show that our system outperforms a strong baseline and


    1https://www.ets.org


    1882


    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1882–1891, Austin, Texas, November 1-5, 2016. Qc 2016 Association for Computational Linguistics

    achieves state-of-the-art performance in automated essay scoring. In order to make it easier for other re- searchers to replicate our results, we have made the source code of our system publicly available2.

    The rest of this paper is organized as follows. Sec- tion 2 gives an overview of related work in the liter- ature. Section 3 describes the automated essay scor- ing task and the evaluation metric used in this paper. We provide the details of our approach in Section 4, and present and discuss the results of our experi- mental evaluation in Section 5. Finally, we conclude the paper in Section 6.

  2. Related Work

    There exist many automated essay scoring systems (Shermis and Burstein, 2013) and some of them are being used in high-stakes assessments. e-rater (At- tali and Burstein, 2004) and Intelligent Essay As- sessor (Foltz et al., 1999) are two notable examples of AES systems. In 2012, a competition on auto- mated essay scoring called ‘Automated Student As- sessment Prize’ (ASAP)3 was organized by Kaggle and sponsored by the Hewlett Foundation. A com- prehensive comparison of AES systems was made in the ASAP competition. Although many AES sys- tems have been developed to date, they have been built with hand-crafted features and supervised ma- chine learning algorithms.

    Researchers have devoted a substantial amount of effort to design effective features for automated es- say scoring. These features can be as simple as es- say length (Chen and He, 2013) or more compli- cated such as lexical complexity, grammaticality of a text (Attali and Burstein, 2004), or syntactic features (Chen and He, 2013). Readability features (Zesch et al., 2015) have also been proposed in the liter- ature as another source of information. Moreover, text coherence has also been exploited to assess the flow of information and argumentation of an essay (Chen and He, 2013). A detailed overview of the features used in AES systems can be found in (Zesch et al., 2015). Moreover, some attempts have been made to address different aspects of essay writing independently. For example, argument strength and organization of essays have been tackled by some


    2https://github.com/nusnlp/nea

    3https://www.kaggle.com/c/asap-aes

    researchers through designing task-specific features for each aspect (Persing et al., 2010; Persing and Ng, 2015).

    Our system, however, accepts an essay text as input directly and learns the features automatically from the data. To do so, we have developed a method based on recurrent neural networks to score the essays in an end-to-end manner. We have ex- plored a variety of neural network models in this pa- per to identify the most suitable model. Our best model is a long short-term memory neural network (Hochreiter and Schmidhuber, 1997) and is trained as a regression method. Similar recurrent neural net- work approaches have recently been used success- fully in a number of other NLP tasks. For exam- ple, Bahdanau et al. (2015) have proposed an atten- tive neural approach to machine translation based on gated recurrent units (Cho et al., 2014). Neural ap- proaches have also been used for syntactic parsing. In (Vinyals et al., 2015), long short-term memory networks have been used to obtain parse trees by using a sequence-to-sequence model and formulat- ing the parsing task as a sequence generation prob- lem. Apart from these examples, recurrent neural networks have also been used for opinion mining (Ir- soy and Cardie, 2014), sequence labeling (Ma and Hovy, 2016), language modeling (Kim et al., 2016; Sundermeyer et al., 2015), etc.


  3. Automated Essay Scoring

    In this section, we define the automated essay scor- ing task and the evaluation metric used for assessing the quality of AES systems.


    1. Task Description

      Automated essay scoring systems are used in evalu- ating and scoring student essays written based on a given prompt. The performance of these systems is assessed by comparing their scores assigned to a set of essays to human-assigned gold-standard scores. Since the output of AES systems is usually a real- valued number, the task is often addressed as a su- pervised machine learning task (mostly by regres- sion or preference ranking). Machine learning algo- rithms are used to learn the relationship between the essays and reference scores.

    2. Evaluation Metric

      The output of an AES system can be compared to the ratings assigned by human annotators using various measures of correlation or agreement (Yan- nakoudakis and Cummins, 2015). These measures include Pearson’s correlation, Spearman’s correla- tion, Kendall’s Tau, and quadratic weighted Kappa (QWK). The ASAP competition adopted QWK as the official evaluation metric. Since we use the ASAP data set for evaluation in this paper, we also use QWK as the evaluation metric in our experi- ments.

      Quadratic weighted Kappa is calculated as fol- lows. First, a weight matrix W is constructed ac- cording to Equation 1:

      (i j)2

      Wi,j = (N 1)2 (1)

      where i and j are the reference rating (assigned by a human annotator) and the hypothesis rating (as- signed by an AES system), respectively, and N is the number of possible ratings. A matrix O is cal- culated such that Oi,j denotes the number of essays that receive a rating i by the human annotator and a rating j by the AES system. An expected count matrix E is calculated as the outer product of his- togram vectors of the two (reference and hypothe- sis) ratings. The matrix E is then normalized such that the sum of elements in E and the sum of ele- ments in O are the same. Finally, given the matrices O and E, the QWK score is calculated according to Equation 2:

      i,j Wi,j Oi,j

      and are capable of learning more complex patterns from data. Therefore, we have mainly focused on recurrent networks in this paper. This section gives a description of the recurrent neural network archi- tecture that we have used for the essay scoring task and the training process.

        1. Model Architecture

          The neural network architecture that we have used in this paper is illustrated in Figure 1. We now describe each layer in our neural network in detail.

          Lookup Table Layer: The first layer of our neural network projects each word into a dLT di- mensional space. Given a sequence of words W represented by their one-hot representations (w1, w2, · · · , wM ), the output of the lookup table layer is calculated by Equation 3:


          LT (W) = (E.w1, E.w2, · · · , E.wM ) (3)


          where E is the word embeddings matrix and will be learnt during training.

          Convolution Layer: Once the dense represen- tation of the input sequence W is calculated, it is fed into the recurrent layer of the network. How- ever, it might be beneficial for the network to ex- tract local features from the sequence before apply- ing the recurrent operation. This optional charac- teristic can be achieved by applying a convolution layer on the output of the lookup table layer. In order to extract local features from the sequence, the convolution layer applies a linear transformation

          to all M windows in the given sequence of vec-

          κ = 1


          i,j

          Wi,j

          Ei,j

          (2)

          tors4. Given a window of dense word representa- tions x , x , · · · , x , the convolution layer first con-

          In our experiments, we compare the QWK score

          of our system to well-established baselines. We also perform a one-tailed paired t-test to determine whether the obtained improvement is statistically significant.

  4. A Recurrent Neural Network Approach

    Recurrent neural networks are one of the most suc- cessful machine learning models and have attracted the attention of researchers from various fields. Compared to feed-forward neural networks, recur- rent neural networks are theoretically more powerful

    1 2 l

    catenates these vectors to form a vector of length l.dLT and then uses Equation 4 to calculate the out- put vector of length dc:


    Conv() = W.+ b (4)

    In Equation 4, W and b are the parameters of the network and are shared across all windows in the sequence.


    4The number of input vectors and the number of output vec- tors of the convolution layer are the same because we pad the sequence to avoid losing border windows.

    Score Linear layer with Sigmoid activation


    ...


    ...

    ...

    Mean over time


    Recurrent layer


    Convolution layer


    Lookup table layer


    w1 w2

    w3 ...

    wM-1 wM



    Figure 1: The convolutional recurrent neural network architecture.


    The convolution layer can be seen as a function that extracts feature vectors from n-grams. Since this layer provides n-gram level information to the subsequent layers of the neural network, it can po- tentially capture local contextual dependencies in the essay and consequently improve the perfor- mance of the system.

    Recurrent Layer: After generating embeddings (whether from the convolution layer or directly from

    information required for the final representation. In order to control the flow of information during pro- cessing of the input sequence, LSTM units make use of three gates to discard (forget) or pass the informa- tion through time. The following equations formally describe the LSTM function:


    it = σ(Wi.xt + Ui.ht1 + bi) ft = σ(Wf .xt + Uf .ht1 + bf )

    the lookup table layer), the recurrent layer starts pro- cessing the input to generate a representation for the given essay. This representation should ideally en- code all the information required for grading the es- say. However, since the essays are usually long,

    consisting of hundreds of words, the learnt vector

    c˜t = tanh(Wc.xt + Uc.ht1 + bc)

    ct = it c˜t + ft ct1

    ot = σ(Wo.xt + Uo.ht1 + bo) ht = ot tanh(ct)

    (5)

    representation might not be sufficient for accurate scoring. For this reason, we preserve all the inter- mediate states of the recurrent layer to keep track of the important bits of information from process- ing the essay. We experimented with basic recur- rent units (RNN) (Elman, 1990), gated recurrent units (GRU) (Cho et al., 2014), and long short-term memory units (LSTM) (Hochreiter and Schmidhu- ber, 1997) to identify the best choice for our task. Since LSTM outperforms the other two units, we only describe LSTM in this section.

    Long short-term memory units are modified re- current units that can cope with the problem of van-

    xt and ht are the input and output vectors at time t, respectively. Wi, Wf , Wc, Wo, Ui, Uf , Uc, and Uo are weight matrices and bi, bf , bc, and bo are bias vectors. The symbol denotes element-wise multiplication and σ represents the sigmoid func- tion.

    Mean over Time: The outputs of the recurrent layer, H = (h1, h2, · · · , hM ), are fed into the mean-over-time layer. This layer receives M vec- tors of length dr as input and calculates an average vector of the same length. This layer’s function is defined in Equation 6:


    M

    ishing gradients more effectively (Pascanu et al., 2013). LSTMs can learn to preserve or forget the

    1

    MoT (H) =

    M

    X ht

    t=1

    (6)

    The mean-over-time layer is responsible for aggre- gating the variable number of inputs into a fixed length vector. Once this vector is calculated, it is fed into the linear layer to be mapped into a score.

    Instead of taking the mean of the intermediate re- current layer states ht, we could use the last state vector hM to compute the score and remove the mean-over-time layer. However, as we will show in Section 5.2, it is much more effective to use the mean-over-time layer and take all recurrent states into account.

    Linear Layer with Sigmoid Activation: The linear layer maps its input vector generated by the mean-over-time layer to a scalar value. This map- ping is simply a linear transformation of the in- put vector and therefore, the computed value is not bounded. Since we need a bounded value in the range of valid scores for each prompt, we apply a sigmoid function to limit the possible scores to the range of (0, 1). The mapping of the linear layer after applying the sigmoid activation function is given by Equation 7:


    s(x) = sigmoid(w.x + b) (7)

    where x is the input vector (MoT (H)), w is the weight vector, and b is the bias value.

    We normalize all gold-standard scores to [0, 1] and use them to train the network. However, dur- ing testing, we rescale the output of the network to the original score range and use the rescaled scores to evaluate the system.

      1. Training

    i

    We use the RMSProp optimization algorithm (Dauphin et al., 2015) to minimize the mean squared error (MSE) loss function over the training data. Given N training essays and their corresponding normalized gold-standard scores s, the model com- putes the predicted scores si for all training essays and then updates the network parameters such that the mean squared error is minimized. The loss func- tion is shown in Equation 8:

    norm of the gradient is larger than a threshold.

    We do not use any early stopping methods. In- stead, we train the neural network model for a fixed number of epochs and monitor the performance of the model on the development set after each epoch. Once training is finished, we select the model with the best QWK score on the development set.

  5. Experiments

    In this section, we describe our experimental setup and present the results. Moreover, an analysis of the results and some discussion are provided in this sec- tion.

    1. Setup

      The dataset that we have used in our experiments is the same dataset used in the ASAP competition run by Kaggle (see Table 1 for some statistics). We use quadratic weighted Kappa as the evaluation metric, following the ASAP competition. Since the test set used in the competition is not publicly available, we use 5-fold cross validation to evaluate our systems. In each fold, 60% of the data is used as our train- ing set, 20% as the development set, and 20% as the test set. We train the model for a fixed number of epochs and then choose the best model based on the development set. We tokenize the essays using the NLTK5 tokenizer, lowercase the text, and normalize the gold-standard scores to the range of [0, 1]. Dur- ing testing, we rescale the system-generated normal- ized scores to the original range of scores and mea- sure the performance.


      Prompt

      #Essays

      Avg length

      Scores

      1

      1,783

      350

      2–12

      2

      1,800

      350

      1–6

      3

      1,726

      150

      0–3

      4

      1,772

      150

      0–3

      5

      1,805

      150

      0–4

      6

      1,800

      150

      0–4

      7

      1,569

      250

      0–30

      8

      723

      650

      0–60

      Table 1: Statistics of the ASAP dataset.


      MSE(s, s) =


      1

      N

      N

      i=1


      (si s)2 (8)

      In order to evaluate the performance of our sys- tem, we compare it to a publicly available open- source6 AES system called ‘Enhanced AI Scor-

      i

      Additionally, we make use of dropout regularization to avoid overfitting. We also clip the gradient if the


      5http://www.nltk.org

      6https://github.com/edx/ease

      ing Engine’ (EASE). This system is the best open- source system that participated in the ASAP com- petition, and was ranked third among all 154 par- ticipating teams. EASE is based on hand-crafted features and regression methods. The features that are extracted by EASE can be categorized into four classes:

      • Length-based features

      • Parts-of-Speech (POS)

      • Word overlap with the prompt

      • Bag of n-grams

        After extracting the features, a regression algorithm is used to build a model based on the training data. The details of the features and the results of using support vector regression (SVR) and Bayesian linear ridge regression (BLRR) are reported in (Phandi et al., 2015). We use these two regression methods as our baseline systems.

        Our system has several hyper-parameters that need to be set. We use the RMSProp optimizer with decay rate (ρ) set to 0.9 to train the network and we set the base learning rate to 0.001. The mini-batch size is 32 in our experiments7 and we train the net- work for 50 epochs. The vocabulary is the 4,000 most frequent words in the training data and all other words are mapped to a special token that represents unknown words. We regularize the network by us- ing dropout (Srivastava et al., 2014) and we set the dropout probability to 0.5. During training, the norm of the gradient is clipped to a maximum value of

        10. We set the word embedding dimension (dLT ) to 50 and the output dimension of the recurrent layer (dr) to 300. If a convolution layer is used, the win- dow size (l) is set to 3 and the output dimension of this layer (dc) is set to 50. Finally, we initialize the lookup table layer using pre-trained word embed- dings8 released by Zou et al. (2013). Moreover, the bias value of the linear layer is initialized such that the network’s output before training is almost equal to the average score in the training data.

        7To create mini-batches for training, we pad all essays in a mini-batch using a dummy token to make them have the same length. To eliminate the effect of padding tokens during train- ing, we mask them to prevent the network from miscalculating the gradients.

        8http://ai.stanford.edu/wzou/mt

        We have performed several experiments to iden- tify the best model architecture for our task. These architectural choices are summarized below:

      • Convolutional vs. recurrent neural network

      • RNN unit type (basic RNN, GRU, or LSTM)

      • Using mean-over-time over all recurrent states vs. using only the last recurrent state

      • Using mean-over-time vs. an attention mecha- nism

      • Using a recurrent layer vs. a convolutional re- current layer

      • Unidirectional vs. bidirectional LSTM

      We have used 8 Tesla K80 GPUs to perform our ex- periments in parallel.

    2. Results and Discussion

      In this section, we present the results of our eval- uation by comparing our system to the above- mentioned baselines (SVR and BLRR). Table 2 (rows 1 to 4) shows the QWK scores of our sys- tems on the eight prompts from the ASAP dataset9. This table also contains the results of our statistical significance tests. The baseline score that we have used for hypothesis testing is underlined and the sta- tistically significant improvements (p < 0.05) over the baseline are marked with ‘*’. It should be noted that all neural network models in Table 2 are unidi- rectional and include the mean-over-time layer. Ex- cept for the CNN model, convolution layer is not included in the networks.

      According to Table 2, all model variations are able to learn the task properly and perform competitively compared to the baselines. However, LSTM per- forms significantly better than all other systems and outperforms the baseline by a large margin (4.1%). However, basic RNN falls behind other models and does not perform as accurately as GRU or LSTM.


      9To aggregate the QWK scores of all prompts, Fisher trans- formation was used in the ASAP competition before averaging QWK scores. However, we found that applying Fisher trans- formation only slightly changes the scores. (If we apply this method to aggregate QWK scores, our best ensemble system (row 7, Table 2) would obtain a QWK score of 0.768.) There- fore we simply take the average of QWK scores across prompts.

      ID

      Systems


      1

      2

      3

      4

      5

      6

      7

      8

      Avg QWK

      1

      CNN

      0.797

      0.634

      0.646

      0.767

      0.746

      0.757

      0.746

      0.687

      0.722

      2

      RNN

      0.687

      0.633

      0.552

      0.744

      0.732

      0.757

      0.743

      0.553

      0.675

      3

      GRU

      0.616

      0.591

      0.668

      0.787

      0.795

      0.800

      0.752

      0.573

      0.698

      4

      LSTM

      0.775

      0.687

      0.683

      0.795

      0.818

      0.813

      0.805

      0.594

      0.746*

      5

      CNN (10 runs)

      0.804

      0.656

      0.637

      0.762

      0.752

      0.765

      0.750

      0.680

      0.726*

      6

      LSTM (10 runs)

      0.808

      0.697

      0.689

      0.805

      0.818

      0.827

      0.811

      0.598

      0.756*

      7

      (5) + (6)

      0.821

      0.688

      0.694

      0.805

      0.807

      0.819

      0.808

      0.644

      0.761*

      8

      EASE (SVR)

      0.781

      0.621

      0.630

      0.749

      0.782

      0.771

      0.727

      0.534

      0.699

      9

      EASE (BLRR)

      0.761

      0.606

      0.621

      0.742

      0.784

      0.775

      0.730

      0.617

      0.705

      Prompts


      Table 2: The QWK scores of the various neural network models and the baselines. The baseline for the statistical significance tests is underlined and statistically significant improvements (p < 0.05) are marked with ‘*’.


      This behaviour is probably because of the relatively long sequences of words in essays. GRU and LSTM have been shown to ‘remember’ sequences and long- term dependencies much more effectively and there- fore, we believe this is the reason behind RNN’s rel- atively poor performance.

      Additionally, we perform some experiments to evaluate ensembles of our systems. We create vari- ants of our network by training with different ran- dom initializations of the parameters. To combine these models, we simply take the average of the scores predicted by these networks. This approach is shown to improve performance by reducing the variance of the model and therefore make the predic- tions more accurate. Table 2 (rows 5 and 6) shows the results of CNN and LSTM ensembles over 10 runs. Moreover, we combine CNN ensembles and LSTM ensembles together to make the predictions (row 7).

      As shown in Table 2, ensembles of models always lead to improvements. We obtain 0.4% and 1.0% improvement from CNN and LSTM ensembles, re- spectively. However, our best model (row 7 in Table 2) is the ensemble of 10 instances of CNN models and 10 instances of LSTM models and outperforms the baseline BLRR system by 5.6%.

      It is possible to use the last state of the recurrent layer to predict the score instead of taking the mean over all intermediate states. In order to observe the effects of this architectural choice, we test the net- work with and without the mean-over-time layer. The results of this experiment are presented in Ta- ble 3, clearly showing that the neural network fails to learn the task properly in the absence of the mean-

      over-time layer. When the mean-over-time layer is not used in the model, the network needs to effi- ciently encode the whole essay into a single state vector and then use it to predict the score. How- ever, when the mean-over-time layer is included, the model has direct access to all intermediate states and can recall the required intermediate information much more effectively and therefore is able to pre- dict the score more accurately.


      Systems

      Avg QWK

      LSTM

      0.746*

      LSTM w/o MoT

      0.540

      LSTM+attention

      0.731*

      CNN+LSTM

      0.708

      BLSTM

      0.699

      EASE (SVR)

      0.699

      EASE (BLRR)

      0.705

      Table 3: The QWK scores of LSTM neural network vari- ants. The baseline for the statistical significance tests is un- derlined and statistically significant improvements (p < 0.05) are marked with ‘*’.

      Additionally, we experiment with three other neu- ral network architectures. Instead of using mean- over-time to average intermediate states, we use an attention mechanism (Bahdanau et al., 2015) to compute a weighted sum of the states. In this case, we calculate the dot product of the intermediate states and a vector trained by the neural network, and then apply a softmax operation to obtain the normalized weights. Another alternative is to add a convolution layer before feeding the embeddings to the recurrent LSTM layer (CNN+LSTM) and eval- uate the model. We also use a bidirectional LSTM model (BLSTM), in which the sequence of words

      is processed in both directions and the intermediate states generated by both LSTM layers are merged and then fed into the mean-over-time layer. The re- sults of testing these architectures are summarized in Table 3.

      The attention mechanism significantly improves the results compared to LSTM without mean-over- time, but it does not perform as well as LSTM with mean-over-time. The other two architectural choices do not lead to further improvements over the LSTM neural network. This observation is in line with the findings of some other researchers (Kadlec et al., 2015) and is probably because of the relatively small number of training examples compared to the capac- ity of the models.

      We have also compared the accuracy of our best system (shown as ‘AES’) with human performance, presented in Table 4. To do so, we calculate the agreement (QWK scores) between our system and each of the two human annotators separately (‘AES

      - H1’ and ‘AES - H2’), as well as the agreement be- tween the two human annotators (‘H1 - H2’). Ac- cording to Table 4, the performance of our system on average is very close to human annotators. In fact, for some of the prompts, the agreement between our system and the human annotators is even higher than the agreement between human annotators. In gen- eral, we can conclude that our method is just below the upper limit and approaching human-level perfor- mance.

      We also compare our system to a recently pub- lished automated essay scoring method based on neural networks (Alikaniotis et al., 2016). Instead of performing cross validation, Alikaniotis et al. (2016) partition the ASAP dataset into two parts by using 80% of the data for training and the remaining 20% for testing. For comparison, we also carry out an experiment on the same training and test data used in (Alikaniotis et al., 2016). Following how QWK scores are computed in Alikaniotis et al. (2016), in- stead of calculating QWK for each prompt sepa- rately and averaging them, we calculate the QWK score for the whole test set, by setting the minimum score to 0 and the maximum score to 60. Using this evaluation setup, our LSTM system achieves a QWK score of 0.987, higher than the QWK score of 0.96 of the best system in (Alikaniotis et al., 2016). In this way of calculating QWK scores, since



      Figure 2: Score variations per timestamp. All scores are nor- malized to the range of [0, 1].


      the majority of the test essays have a much smaller score range (see Table 1) compared to [0, 60], the differences between the system-predicted scores and the gold-standard scores will be small most of the time. For example, more than 55% of the essays in the test set have a score range of [0, 3] or [0, 4] and therefore, for these prompts, the differences be- tween human-assigned gold-standard scores and the scores predicted by an AES system will be small in the range of [0, 60]. For this reason, in contrast to prompt-specific QWK calculation, the QWK scores are much higher in this evaluation setting and far exceed the QWK score for human agreement when computed in a prompt-specific way (see Table 4).

      Interpreting neural network models and the inter- actions between nodes is not an easy task. However, it is possible to gain an insight of a network by an- alyzing the behavior of particular nodes. In order to understand how our neural network assigns the scores, we monitor the score variations while test- ing the model. Figure 2 displays the score variations for three essays after processing each word (at each timestamp) by the neural network. We have selected a poorly written essay, a well written essay, and an average essay with normalized gold-standard scores of 0.2, 0.8, and 0.6, respectively.

      According to Figure 2, the network learns to take essay length into account and assigns a very low score to all short essays with fewer than 50 words, regardless of the content. This pattern recurs for all essays and is not specific to the three selected essays in Figure 2. However, if an essay is long enough, the content becomes more important and the AES system starts discriminating well written

      Description


      1

      2

      3

      4

      5

      6

      7

      8

      Avg QWK

      AES - H1

      0.750

      0.684

      0.662

      0.759

      0.751

      0.791

      0.731

      0.607

      0.717

      AES - H2

      0.767

      0.690

      0.632

      0.762

      0.769

      0.775

      0.752

      0.530

      0.710

      H1 - H2

      0.721

      0.812

      0.769

      0.851

      0.753

      0.776

      0.720

      0.627

      0.754

      Prompts


      Table 4: Comparison with human performance. H1 and H2 denote human rater 1 and human rater 2, respectively, and AES refers to our best system (ensemble of CNN and LSTM models).


      essays from poorly written ones. As shown in Fig- Yigal Attali and Jill Burstein. 2004. Automated essay

      ure 2, the model properly assigns a higher score to the well written essay 2, while giving lower scores to the other essays. This observation confirms that the model successfully learns the required features for automated essay scoring. While it is difficult to associate different parts of the neural network model with specific features, it is clear that appropriate in- dicators of essay quality are being learnt, including essay length and essay content.

  6. Conclusion

In this paper, we have proposed an approach based on recurrent neural networks to tackle the task of au- tomated essay scoring. Our method does not rely on any feature engineering and automatically learns the representations required for the task. We have ex- plored a variety of neural network model architec- tures for automated essay scoring and have achieved significant improvements over a strong open-source baseline. Our best system outperforms the baseline by 5.6% in terms of quadratic weighted Kappa. Fur- thermore, an analysis of the network has been per- formed to get an insight of the recurrent neural net- work model and we show that the method effectively utilizes essay content to extract the required infor- mation for scoring essays.

Acknowledgments

This research is supported by Singapore Ministry of Education Academic Research Fund Tier 2 grant MOE2013-T2-1-150. We are also grateful to the anonymous reviewers for their helpful comments.


References

Dimitrios Alikaniotis, Helen Yannakoudakis, and Marek Rei. 2016. Automatic text scoring using neural net- works. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.

scoring with e-raterQR v. 2.0. Technical report, Educa- tional Testing Service.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Represen- tations.

Hongbo Chen and Ben He. 2013. Automated essay scoring by maximizing human-machine agreement. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase represen- tations using RNN encoder–decoder for statistical ma- chine translation. In Proceedings of the 2014 Confer- ence on Empirical Methods in Natural Language Pro- cessing.

Ronan Collobert, Jason Weston, Le´on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493– 2537.

Yann N. Dauphin, Harm de Vries, and Yoshua Bengio. 2015. Equilibrated adaptive learning rates for non- convex optimization. In Advances in Neural Informa- tion Processing Systems 28.

Jeffrey L. Elman. 1990. Finding structure in time. Cog- nitive Science, 14(2):179–211.

Peter W Foltz, Darrell Laham, and Thomas K Landauer. 1999. The Intelligent Essay Assessor: Applications to educational technology. Interactive Multimedia Elec- tronic Journal of Computer-Enhanced Learning.

Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780.

Ozan Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Nat- ural Language Processing.

Rudolf Kadlec, Martin Schmid, and Jan Kleindienst. 2015. Improved deep learning baselines for Ubuntu corpus dialogs. In Proceesings of the NIPS 2015

Workshop on Machine Learning for Spoken Language Understanding and Interaction.

Yoon Kim, Yacine Jernite, David Sontag, and Alexan- der M Rush. 2016. Character-aware neural language models. In Proceedings of the Thirtieth AAAI Confer- ence on Artificial Intelligence.

Xuezhe Ma and Eduard Hovy. 2016. End-to-end se- quence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural net- works. In Proceedings of the 30th International Con- ference on Machine Learning.

Isaac Persing and Vincent Ng. 2015. Modeling argument strength in student essays. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing.

Isaac Persing, Alan Davis, and Vincent Ng. 2010. Mod- eling organization in student essays. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing.

Peter Phandi, Kian Ming A. Chai, and Hwee Tou Ng. 2015. Flexible domain adaptation for automated essay scoring using correlated linear regression. In Proceed- ings of the 2015 Conference on Empirical Methods in Natural Language Processing.

Mark D. Shermis and Jill Burstein, editors. 2013. Hand- book of Automated Essay Evaluation: Current Appli- cations and New Directions. Routledge.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Re- search, 15(1):1929–1958.

Martin Sundermeyer, Hermann Ney, and Ralf Schlu¨ter. 2015. From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 23(3):517–529.

Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Advances in Neural Informa- tion Processing Systems 28.

Helen Yannakoudakis and Ronan Cummins. 2015. Eval- uating the performance of automated text scoring sys- tems. In Proceedings of the Tenth Workshop on Inno- vative Use of NLP for Building Educational Applica- tions.

Torsten Zesch, Michael Wojatzki, and Dirk Scholten- Akoun. 2015. Task-independent features for auto- mated essay grading. In Proceedings of the Tenth

Workshop on Innovative Use of NLP for Building Ed- ucational Applications.

Will Y. Zou, Richard Socher, Daniel Cer, and Christo- pher D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Nat- ural Language Processing.